Maximum Likelihood Principle

The "useless" bayesian vision

The goal of maximum likelihood is to fit a distribution to some data.
Using the Bayes Theorem, we want to find the most likely value for the parametetrs of our model, given the data.

argmaxP(θ|X)=argmaxP(X|θ)P(θ)P(X)Where:

  • P(θ) is called the prior
  • P(X|θ) is called the likelihood, which is not really a probability
  • P(θ|X) is called the posterior

The likelyhood is equal to the probability density function of a gaussian, if we assume that the data was generated by a gaussian distribution.

Now the real stuff lol

We basically bruteforce fit a gaussian distribution on the data and then we get the one that maximizes the likelyhood function.

So we begin by fitting the gaussians, we start with μ=28 and σ=2:

Which yields a likelyhood of:
L(μ,Σ;x)=1(2π)D/21|Σ|1/212(xμ)TΣ1(xμ)=0.03

But we can do better right?
In fact, if we plug in μ=30 and σ=2:

We get a likelyhood of 0.12, which is definitely better!

By the way, if we plot the likelyhood hovering all over the possible values of μ, we can actually see that we can get the maximum likelyhood where the derivative of the whole thing is 0:

If we have multiple data points, the likelihood function will be the product of all the gaussians/individual likelyhood functions that are generated from the data points.

p(D;μ,Σ)=p({x1,,xN};μ,Σ)=p(x1;μ,Σ)p(x2;μ,Σ)p(xN;μ,Σ)p(D;μ,Σ)=i=1Np(xi;μ,Σ)=i=1NN(xi;μ,Σ)

So we take the derivative of this shit with respect to μ and we actually find the Maximum Likelihood parameters.

Obviously, you can do the same to find the standard deviations. You lock in the μ and you let the standard deviation change, then you stick with the value that gives the maximum likelyhood:

Tldr

In order to get the maximum likelihood parameters for multiple data points, we must multiply all the individual likelihood functions and take the derivative of that, solving for μ and σ.

After endless mathematical proofs... the end

There is a whole ass proof to justify that the maximum likelihood estimate is equal to the mean of the measurements. The proof is covered in the link below:

Similarly, there is a proof that shows that the width of the distribution is equal to the standard deviation of the measurements:

These notions may be obvious, but now we have the math to back it up.

Tldr